The Development, Deployment, and Evaluation of the CLEFT-Q Computerized Adaptive Test: A Multimethods Approach Contributing to Personalized, Person-Centered Health Assessments in Plastic Surgery

Background Routine use of patient-reported outcome measures (PROMs) and computerized adaptive tests (CATs) may improve care in a range of surgical conditions. However, most available CATs are neither condition-specific nor coproduced with patients and lack clinically relevant score interpretation. Recently, a PROM called the CLEFT-Q has been developed for use in the treatment of cleft lip or palate (CL/P), but the assessment burden may be limiting its uptake into clinical practice. Objective We aimed to develop a CAT for the CLEFT-Q, which could facilitate the uptake of the CLEFT-Q PROM internationally. We aimed to conduct this work with a novel patient-centered approach and make source code available as an open-source framework for CAT development in other surgical conditions. Methods CATs were developed with the Rasch measurement theory, using full-length CLEFT-Q responses collected during the CLEFT-Q field test (this included 2434 patients across 12 countries). These algorithms were validated in Monte Carlo simulations involving full-length CLEFT-Q responses collected from 536 patients. In these simulations, the CAT algorithms approximated full-length CLEFT-Q scores iteratively, using progressively fewer items from the full-length PROM. Agreement between full-length CLEFT-Q score and CAT score at different assessment lengths was measured using the Pearson correlation coefficient, root-mean-square error (RMSE), and 95% limits of agreement. CAT settings, including the number of items to be included in the final assessments, were determined in a multistakeholder workshop that included patients and health care professionals. A user interface was developed for the platform, and it was prospectively piloted in the United Kingdom and the Netherlands. Interviews were conducted with 6 patients and 4 clinicians to explore end-user experience. Results The length of all 8 CLEFT-Q scales in the International Consortium for Health Outcomes Measurement (ICHOM) Standard Set combined was reduced from 76 to 59 items, and at this length, CAT assessments reproduced full-length CLEFT-Q scores accurately (with correlations between full-length CLEFT-Q score and CAT score exceeding 0.97, and the RMSE ranging from 2 to 5 out of 100). Workshop stakeholders considered this the optimal balance between accuracy and assessment burden. The platform was perceived to improve clinical communication and facilitate shared decision-making. Conclusions Our platform is likely to facilitate routine CLEFT-Q uptake, and this may have a positive impact on clinical care. Our free source code enables other researchers to rapidly and economically reproduce this work for other PROMs.

Rasch models were developed from calibration dataset responses following listwise exclusion of participants with incomplete response sets. To generate Rasch models, we used a fixed-quadrature expectation maximization (EM) algorithm [1]. Before generating Rasch models, the two middle response options in the Speech Distress and Speech Function scales were collapsed, and scoring reversed to represent the current version of the CLEFT-Q [2].

Supplementary Methods: Missing Data and Outliers in the Validation Dataset
Subjects who declined to answer any CLEFT-Q items were not included in the validation dataset. For the included participants, there were missing responses to items in the Speech Distress scale and Social Function scale. In the Speech Distress scale, missing responses existed for one participant who did not answer nine of the ten items. In the Social scale, there were 141 missing responses to item nine, which was added to the scale during data collection. There were also nine missing responses to item seven in the Social Function scale, and all of these occurred at the same study center.
Missing responses were handled through listwise exclusion of respondents with incomplete response sets. For the Social Function scale, we repeated the analysis including all participants, and imputed the 150 missing responses using one iteration of multiple imputation by chained equations and a proportional odds model [3]. Each analysis was performed with and without outliers, who were identified by Mahalanobis distance [4]. Hereafter we describe these repeat analyses as sensitivity-type analyses.

Supplementary Methods: Computerized Adaptive Test Simulation Settings
To perform these computerized adaptive test (CAT) simulations, we developed an R package called cleftqCATsim, which contains 15 functions that allow readers to recreate these experiments with their own data.
The key CAT simulation functions serve as convenience wrappers for Phil Chalmers' mirtCAT package [5]. We have made cleftqCATsim available through GitHub with an illustrative vignette [6].
In the CAT simulations, factor scores were calculated for each validation dataset respondent with an expected a posteriori approach. Items were selected based on minimum expected posterior variance. Methodological orientation and Theory Grounded theory.

10.
Sampling Patient-participants were purposively selected for diversity in age, gender, ethnicity and diagnosis. Clinician-participants were purposively selected for diversity in occupation.

11.
Method of approach Participants were approached following routine clinic appointments. All participants had recent (< 7 day) experience of using the CLEFT-Q computerized adaptive test. 12.
Non-participation Two patient-participants were unable to attend interviews within 7 days due to logistic and time constraints. Setting 14.
Setting of data collection Data were collected either over videoconferencing software in the participant's home, or in the clinical environment following an appointment.

15.
Presence of non-participants For participants aged < 18 years an adult with parental responsibility was present during the interview. 16.
Description of sample Sample demographics are presented in Supplementary Table 4.

17.
Interview guide Interview schedules were piloted with one of the interviewer's doctoral supervisors. These are presented in Supplementary Table 2 and Supplementary Table 3

Supplementary Table 2
Interview schedule for patient-participants.

Focus area Opening question and examples of additional probes Introduction
Tell me about your last visit to see the cleft team.
Is the CLEFT-Q CAT a worthwhile adjunct to clinical practice?
What did you think of the CLEFT-Q CAT questionnaire?
Do you think it changed anything about your conversation with the cleft team?
What did it change?
Did it change anything else?
Did you like completing it?
Do you think it's a good idea to ask other people to complete the CLEFT-Q CAT at their appointments, just like you did?
Why do you think that?

How burdensome is the CLEFT-Q CAT?
How difficult was it to complete the CLEFT-Q CAT?
What was difficult about it?
Did it take a long time?
Was it boring?
What was boring about it?
Did it make you tired?
If you had the choice, would you rather do the CLEFT-Q CAT on an iPad (just like you did) or would you rather have a pen-and-paper version of the questionnaire, with slightly more questions? Why?

Facilitators and barriers to CLEFT-Q CAT implementation
Can you think of anything that might make you less likely to use the CLEFT-Q CAT?
Can you think of anything that might make you more likely to use the CLEFT-Q CAT?

Areas for CLEFT-Q CAT improvement
If you could change anything about the CLEFT-Q CAT, what would you change? Why?
Is there anything you really liked about the CLEFT-Q CAT?
Supplementary Table 3 Interview schedule for clinician-participants.

Focus area Opening question and examples of additional probes Introduction
Tell me about your role in the cleft team.
Have you used the CLEFT-Q CAT a lot?

Is the CLEFT-Q CAT a worthwhile adjunct to clinical practice?
What do you think of the CLEFT-Q CAT questionnaire?
Do you think it has changed any aspect of your clinical care, or that of your colleagues?
What has it changed?
Has it changed anything else?
Do you think patients like completing it?
How useful is it as an adjunct to clinical care?
Do you think other cleft teams should be using it?
Why do you think that?

How burdensome is the CLEFT-Q CAT?
How burdensome is the CLEFT-Q CAT, from your perspective?
Has it changed your workload, or that of your colleagues?
In what way?
Does it make clinics faster or slower?
Have patients given you feedback about the burden of completing it?

Facilitators and barriers to CLEFT-Q implementation
Can you think of any barriers to cleft teams implementing the CLEFT-Q CAT?
Can you think of anything that made it easier or more difficult to implement?
What advice would you give other cleft teams that are thinking about using the CLEFT-Q CAT?

Areas for CLEFT-Q CAT improvement
If you could change anything about the CLEFT-Q CAT, what would you change? Why?
Is there anything you really like about the CLEFT-Q CAT?

Supplementary Results: Missing Data in the Calibration Dataset
An analysis of missing items was performed for the calibration dataset and is presented in Sheet 1 of Multimedia Appendix 2. Missing item responses were largely missing at random (explainable by other variables). For example, 84% (837/991) of participants missing one or more Jaw scale item(s) were under the age of 12 years. This is because only CLEFT-Q field test participants aged 12-29 years were asked to complete Jaw scale items [2]. Similarly, 50% (307/614) of participants missing one or more School scale items were not attending school (and therefore not administered these items in the CLEFT-Q field test). In the calibration sample, 43% (263/615) of participants missing Speech Distress items, and 39% (263/670) of those missing Speech Function items were born with a cleft lip only, and therefore unlikely to use these subscales in a realworld setting.

Supplementary Results: Rasch Parameterization
Rasch model parameters and fit statistics are presented in Sheet 2 of Multimedia Appendix 2.

Supplementary Results: Missing Data and Outliers in the Validation Dataset
The proportions of outliers for each scale are presented in Sheet 3 of Multimedia Appendix 2.

Supplementary Results: Computerized Adaptive Test Simulation Settings
Full results from the computerized adaptive test (CAT) simulations are presented in Sheet 4 of Multimedia Appendix 2. In this sheet, root mean squared error (RMSE) and 95% limits of agreement are presented as person-location logits, and median values for standard error of measurement are presented for each assessment, with their inter-quartile ranges. This includes all sensitivity-type analyses. In sheet 5 of Multimedia Appendix 2, these results are presented as transformed (0-100) CLEFT-Q scores.

Supplementary Results: Multistakeholder Workshop
Voting results for stopping rules (CAT assessment lengths) at the multistakeholder workshop are presented in Sheet 6 of Multimedia Appendix 2.

Supplementary Results: Concerto Front-End
The patient-facing front-end of the Concerto-based CLEFT-Q CAT app is illustrated in Supplementary Figure 1 and Supplementary Figure 2. Supplementary Figure 1 shows the CLEFT-Q CAT launcher, where relevant scales for the patient can be selected, and Supplementary Figure 2 shows an example item.

Supplementary Figure 1
The CLEFT-Q computerized adaptive test launcher.

Supplementary Figure 2
An example item from the CLEFT-Q computerized adaptive test.

Supplementary Results: Interview Participants
Characteristics of interview participants are displayed in Supplementary Table 4.